The Text Mining Learning Labs during the second week of the Summer Workshop are guided by a recent publication by Rosenberg et al. (2020) and will focus on analyzing tweets about the Next Generation Science Standards (NGSS) and Common Core State Standards (CCSS). We’ll dive deeper into these tweets in Text Mining (TM) Module 1: Public Sentiment and the State Standards. For now, this supplemental learning lab is designed to help you understand how data used in the study and the learning labs was collected. More importantly though, this supplemental lab is will get you up and running with the Twitter API in case you are interested in using data from Twitter in your own research or evaluation work.
As noted in our Getting Started activity, R uses “packages” add-ons that enhance its functionality. One package that we’ll be using extensively is {tidyverse}. The {tidyverse} package is actually a collection of R packages designed for reading, wrangling, and exploring data and which all share an underlying design philosophy, grammar, and data structures.
Click the green arrow in the right corner of the “code chunk” that follows to load the {tidyverse} library.
library(tidyverse)
Registered S3 methods overwritten by 'dbplyr':
method from
print.tbl_lazy
print.tbl_sql
── Attaching packages ────────────────────────────────────────────────────────────── tidyverse 1.3.1 ──
✓ tibble 3.1.2 ✓ stringr 1.4.0
✓ purrr 0.3.4 ✓ forcats 0.5.1
── Conflicts ───────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
x dplyr::filter() masks stats::filter()
x purrr::flatten() masks rtweet::flatten()
x dplyr::lag() masks stats::lag()
Again, don’t worry if you saw a number of messages: those probably mean that the tidyverse loaded just fine. Any conflicts you may have seen mean that functions in these packages you loaded have the same name as functions in other packages and R will default to function from the last loaded package unless you specify otherwise.
The rtweet package provides users a range of functions designed to extract data from Twitter’s REST and streaming APIs and has three main goals:
Formulate and send requests to Twitter’s REST and stream APIs.
Retrieve and iterate over returned data.
Wrangling data into tidy structures.
Let’s load the {rtweet} package that we’ll be using to accomplish all three of the goals listed above:
library(rtweet)
Since one of our goals for TM Module 1 is a simplistic replication of the study by (Rosenberg et al. 2020), let’s begin by introducing the search_tweets() function to mine some tweets about the Next Generation Science Standards.
Use the code chunk below to run the following code to request from Twitter’s API 5,000 tweets containing the NGSSchat hashtag and store as a new data frame called ngss_tweets_q1:
ngss_tweets_q1 <- search_tweets(q = "#NGSSchat",
n = 5000)
Note that the first argument the search_tweets() function expects, q =, is the search term included in quotation marks and that n = specifies the maximum number of tweets we want.
Use the code chunk below to view your new ngss_tweets_q1 data frame using one of the methods previously introduced to help answer the following questions:
# your code here
How many tweets did our first query using the Twitter API actually return? How many variables?
Why do you think our query pulled in far less than 5,000 tweets requested? Hint.
How many tweets are returned if you don’t include the n = argument?
Does our query also include retweets? How do you know?
Does capitalization in your query matter? Use the code chunk below to find out.
# your code here
In Understanding public sentiment about educational reforms: The Next Generation Science Standards on Twitter, Rosenberg et al. (2020) accessed tweets and user information from the hashtag-based #NGSSchat online community, including all tweets that used the following phrases: “ngss,” “next generation science standard/s,” “next gen science standard/s.” Note that “/” indicates an additional phrase featuring the respective plural form.
Let’s modify our query using the OR operator to also include “ngss” so it will return tweets containing either #NGSSchat or “ngss” and assign to ngss_or_tweets:
ngss_tweets_q2 <- search_tweets(q = "#NGSSchat OR ngss",
n = 5000)
Downloading [=>---------------------------------------] 4%
Downloading [=>---------------------------------------] 6%
Downloading [==>--------------------------------------] 8%
Downloading [===>-------------------------------------] 10%
Downloading [====>------------------------------------] 12%
Downloading [=====>-----------------------------------] 14%
Downloading [======>----------------------------------] 16%
Downloading [======>----------------------------------] 18%
Downloading [=======>---------------------------------] 20%
Downloading [========>--------------------------------] 22%
Downloading [=========>-------------------------------] 24%
Downloading [==========>------------------------------] 26%
ngss_tweets_q2
In the following code chunk, try including both search terms but excluding the OR operator to answer the questions below:
# your code here
Does excluding the OR operator return more tweets, the same number of tweets, or fewer tweets? Why?
Does our query also include tweets containing the #ngss hashtag?
What other useful arguments does the search_tweet() function contain? Try adding one and see what happens.
Hint: Use the ?search_tweets help function to learn more about the q argument and other arguments for composing search queries.
Unfortunately, the OR operator will only get us so far. In order to include the additional search terms, we will need to use the c() function to combine our search terms into a single list.
The rtweets package has an additional search_tweets2() function for using multiple queries in a search. To do this, either wrap single quotes around a search query using double quotes, e.g., q = '"next gen science standard"' or escape each internal double quote with a single backslash, e.g., q = "\"next gen science standard\"".
Copy and past the following code to store the results of our query in ngss_tweets:
ngss_tweets_q3 <- search_tweets2(q = c("#NGSSchat OR ngss",
'"next generation science standard"'),
n = 5000)
Notice the unique syntax required for the query argument. For example, when “OR” is entered between search terms, query = "#NGSSchat OR ngss", Twitter’s REST API should return any tweet that contains either “#NGSSchat” or “ngss.” It is also possible to search for exact phrases using double quotes. To do this, either wrap single quotes around a search query using double quotes, e.g., q = '"next generation science standard"' as we did above, or escape each internal double quote with a single backslash, e.g., q = "\"next generation science standard\"".
We still have a few queries to add in order to replicate the approach by Rosenberg et al, but dealing with that many queries inside a single function is a bit tedious.
Let’s go ahead and create our very first “dictionary” — we’ll learn more about dictionary-based approaches to text mining in Learning Lab 3 — for identifying tweets related to the NGSS standards, and then pass that dictionary to the q = query argument to pull related tweets:
To do so, we’ll need to add some additional search terms to our list. Run the following code to store your dictionary and queried tweets in your environment:
ngss_dictionary <- c("#NGSSchat OR ngss",
'"next generation science standard"',
'"next generation science standards"',
'"next gen science standard"',
'"next gen science standards"')
ngss_tweets_q4 <- search_tweets2(ngss_dictionary,
n=5000)
In the code chunk below, write a new query based on a STEM area of interest.
Assign your search to a new object called my_tweets or something appropriate.
Output your new dataset using the datatable() function from the DT package and take a quick look.
# your code here
To learn more about constructing search terms using the query argument, enter ?search_tweets in your console and review the documentation for the q= argument.
For your own research, you may be interested in exploring posts by specific users rather than topics, key words, or hashtags. Yes, there is a function for that too!
For example, let’s create another list containing the usernames of the LASER Institute leads using the c() function again and use the get_timelines() function to get the most recent tweets from each of those users:
laser_peeps <- c("sbkellogg", "jrosenberg6432", "yanecnu", "robmoore3", "hollylynnester")
laser_tweets <- laser_peeps %>%
get_timelines(include_rts=FALSE)
Notice that you can use the pipe operator with the rtweet functions just like you would other functions from the tidyverse.
And let’s use the sample_n() function from the dplyr package to pick 10 random tweets and use select() to select and view just the screenname and text columns that contains the user and the content of their post:
sample_n(laser_tweets, 10) %>%
select(screen_name, text)
The rtweet package also has handy ts_plot function built into rtweet to take a very quick look at how far back our data set goes:
ts_plot(ngss_tweets_q4, by = "days")
Notice that this effectively creates a {ggplot} time series plot for us. I’ve included the by = argument which by default is set to “days.” It looks like tweets go back 9 days which is the rate limit set by Twitter.
Use the code chunk below and to change the time series plot above from “days” to “hours” and see what happens.
# your code here
Congrats! You made it to the end of the rtweet & the Twitter API supplemental learning lab. Knit your document and check to see if you encounter any errors.
Try one of the following search functions from the rtweet vignette:
get_timelines() Get the most recent 3,200 tweets from users.stream_tweets() Randomly sample (approximately 1%) from the live stream of all tweets.get_friends() Retrieve a list of all the accounts a user follows.get_followers() Retrieve a list of the accounts following a user.get_favorites() Get the most recently favorited statuses by a user.get_trends() Discover what’s currently trending in a city.search_users() Search for 1,000 users with the specific hashtag in their profile bios.We’ve only scratched the surface of the number of functions available in the rtweets package for searching Twitter. To learn more about the rtweet package, you can find full documentation on CRAN at: https://cran.r-project.org/web/packages/rtweet/rtweet.pdf
Or use the following function to access the package vignette:
vignette("intro", package="rtweet")